Looking over previous homeworks to identify different forecasting approaches

# Import required R libraries
library(fpp3)
#library(tidyverse)
library(readxl)

This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday October 31. I will accept late submissions with a penalty until the meetup after that when we review some projects.

Part A – ATM Forecast

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable Cash is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an Excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

# Read in data
atm_data_raw <- read_excel("data/ATM624Data.xlsx")

# Initial output to see data
head(atm_data_raw)
## # A tibble: 6 × 3
##    DATE ATM    Cash
##   <dbl> <chr> <dbl>
## 1 39934 ATM1     96
## 2 39934 ATM2    107
## 3 39935 ATM1     82
## 4 39935 ATM2     89
## 5 39936 ATM1     85
## 6 39936 ATM2     90
summary(atm_data_raw)
##       DATE           ATM                 Cash        
##  Min.   :39934   Length:1474        Min.   :    0.0  
##  1st Qu.:40026   Class :character   1st Qu.:    0.5  
##  Median :40118   Mode  :character   Median :   73.0  
##  Mean   :40118                      Mean   :  155.6  
##  3rd Qu.:40210                      3rd Qu.:  114.0  
##  Max.   :40312                      Max.   :10919.8  
##                                     NA's   :19
dim(atm_data_raw)
## [1] 1474    3
# 1474    3
#Cash in hundreds

# Define as tsibble
atm_data_ts <- atm_data_raw %>%
  as_tsibble(index = DATE, key = ATM)
library(seasonal)
atm1_data_ts <- atm_data_ts %>%
  filter(ATM == 'ATM1')

summary(atm1_data_ts)
##       DATE           ATM                 Cash       
##  Min.   :39934   Length:365         Min.   :  1.00  
##  1st Qu.:40025   Class :character   1st Qu.: 73.00  
##  Median :40116   Mode  :character   Median : 91.00  
##  Mean   :40116                      Mean   : 83.89  
##  3rd Qu.:40207                      3rd Qu.:108.00  
##  Max.   :40298                      Max.   :180.00  
##                                     NA's   :3
atm1_data_ts
## # A tsibble: 365 x 3 [1]
## # Key:       ATM [1]
##     DATE ATM    Cash
##    <dbl> <chr> <dbl>
##  1 39934 ATM1     96
##  2 39935 ATM1     82
##  3 39936 ATM1     85
##  4 39937 ATM1     90
##  5 39938 ATM1     99
##  6 39939 ATM1     88
##  7 39940 ATM1      8
##  8 39941 ATM1    104
##  9 39942 ATM1     87
## 10 39943 ATM1     93
## # … with 355 more rows
atm1_data_ts %>%
  autoplot(Cash)

# Calculate median value for ATM1
median <- median(atm1_data_ts$Cash, na.rm=TRUE)

# Set NAs to median
atm1_data_ts$Cash[is.na(atm1_data_ts$Cash)] <- median

summary(atm1_data_ts)
##       DATE           ATM                 Cash       
##  Min.   :39934   Length:365         Min.   :  1.00  
##  1st Qu.:40025   Class :character   1st Qu.: 73.00  
##  Median :40116   Mode  :character   Median : 91.00  
##  Mean   :40116                      Mean   : 83.95  
##  3rd Qu.:40207                      3rd Qu.:108.00  
##  Max.   :40298                      Max.   :180.00
head(atm1_data_ts)
## # A tsibble: 6 x 3 [1]
## # Key:       ATM [1]
##    DATE ATM    Cash
##   <dbl> <chr> <dbl>
## 1 39934 ATM1     96
## 2 39935 ATM1     82
## 3 39936 ATM1     85
## 4 39937 ATM1     90
## 5 39938 ATM1     99
## 6 39939 ATM1     88
atm1_data_ts <- atm1_data_ts %>%
  mutate(DATE = as_date(DATE)) %>%
  update_tsibble(index = DATE)

atm1_data_ts
## # A tsibble: 365 x 3 [1D]
## # Key:       ATM [1]
##    DATE       ATM    Cash
##    <date>     <chr> <dbl>
##  1 2079-05-03 ATM1     96
##  2 2079-05-04 ATM1     82
##  3 2079-05-05 ATM1     85
##  4 2079-05-06 ATM1     90
##  5 2079-05-07 ATM1     99
##  6 2079-05-08 ATM1     88
##  7 2079-05-09 ATM1      8
##  8 2079-05-10 ATM1    104
##  9 2079-05-11 ATM1     87
## 10 2079-05-12 ATM1     93
## # … with 355 more rows
# From: https://stats.stackexchange.com/questions/494013/control-the-period-for-daily-time-series-in-tsibbles
atm1_dcmp <- atm1_data_ts %>%
  model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic"))) 

atm1_dcmp %>% components() %>% autoplot()

components(atm1_dcmp) %>%
  as_tsibble() %>%
  autoplot(Cash, colour = "gray") +
  geom_line(aes(y=season_adjust), colour = "#0072B2") +
  labs(y = "Cash (in hundreds)", title = "Seasonally Adjusted Trendline")

# from textbook: If the seasonal component is removed from the original data, the resulting values are the “seasonally adjusted” data.
# Create training set (365 total, perhaps assume 1 year of data)
# 292 days, yes, using the generating dates based on the integer values


train_atm1 <- atm1_data_ts %>% 
  filter_index("2079-05-03" ~ "2080-02-20")

train_atm1
## # A tsibble: 294 x 3 [1D]
## # Key:       ATM [1]
##    DATE       ATM    Cash
##    <date>     <chr> <dbl>
##  1 2079-05-03 ATM1     96
##  2 2079-05-04 ATM1     82
##  3 2079-05-05 ATM1     85
##  4 2079-05-06 ATM1     90
##  5 2079-05-07 ATM1     99
##  6 2079-05-08 ATM1     88
##  7 2079-05-09 ATM1      8
##  8 2079-05-10 ATM1    104
##  9 2079-05-11 ATM1     87
## 10 2079-05-12 ATM1     93
## # … with 284 more rows
# Fit the models
fit_atm1 <- train_atm1 %>%
  model(
    Naive = NAIVE(Cash),
    `Seasonal naive` = SNAIVE(Cash),
    `Random walk` = RW(Cash ~ drift())
  )

fit_atm1_snaive <- train_atm1 %>%
  model(SNAIVE(Cash))

fit_atm1 %>% accuracy()
## # A tibble: 3 × 11
##   ATM   .model         .type       ME  RMSE   MAE   MPE  MAPE  MASE RMSSE   ACF1
##   <chr> <chr>          <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 ATM1  Naive          Trai… 9.90e- 2  49.9  37.5 -84.8  120.  2.05  1.74 -0.360
## 2 ATM1  Seasonal naive Trai… 5.40e- 1  28.6  18.3 -82.8  103.  1     1     0.145
## 3 ATM1  Random walk    Trai… 9.64e-16  49.9  37.5 -85.1  120.  2.05  1.74 -0.360
# Check the residuals of Seasonal Naive
fit_atm1_snaive %>% gg_tsresiduals()

# Generate forecasts for 73 days
fc_atm1 <- fit_atm1 %>% forecast(h = "71 days")

fc_atm1 %>% accuracy(atm1_data_ts)
## # A tibble: 3 × 11
##   .model         ATM   .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE    ACF1
##   <chr>          <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 Naive          ATM1  Test  -44.8  56.7  45.7 -602.  602.  2.49  1.98 -0.0619
## 2 Random walk    ATM1  Test  -48.4  59.7  49.2 -619.  620.  2.68  2.08 -0.0425
## 3 Seasonal naive ATM1  Test  -20.3  60.9  49.7 -483.  511.  2.71  2.13  0.281
# Plot forecasts against actual values
fc_atm1 %>%
  autoplot(train_atm1, level = NULL) +
  autolayer(
    filter_index(atm1_data_ts, "2080-02-21" ~ "2080-05-01"),
    colour = "black"
  ) +
  labs(
    y = "Cash (in hundreds)",
    title = "Forecasts for ATM1 Withdrawals"
  ) +
  guides(colour = guide_legend(title = "Forecast"))

# Result: Not good, SNAIVE at least attempts the seasonal nature
# ETS

fit_atm1_ets <- train_atm1 %>%
  model(
    add = ETS(Cash ~ error("A") + trend("N") + season("A")),
    mult = ETS(Cash ~ error("M") + trend("N") + season("M")),
    damp = ETS(Cash ~ error("M") + trend("Ad") + season("M"))
  )

# Forecast for 73 days
fc_atm1_ets <- fit_atm1_ets %>% forecast(h = "71 days")

fc_atm1_ets %>%
  autoplot(filter_index(train_atm1, "2079-12-01" ~ "2080-05-01"), level = NULL) +
  autolayer(
    filter_index(atm1_data_ts, "2079-12-01" ~ "2080-05-01"),
    colour = "black"
  ) +
  labs(y="Billions $USD", title="GDP: China") +
  guides(colour = guide_legend(title = "Forecast"))

# Forecast over the test set
# Measure the accuracy
# Forecast beyond the data
# NAIVE
# SNAIVE
# drift()
# ETS
# ARIMA
atm2_data_ts <- atm_data_ts %>%
  filter(ATM == 'ATM2')

atm2_data_ts %>%  autoplot(Cash)

# Calculate median value for ATM2
median <- median(atm2_data_ts$Cash, na.rm=TRUE)

# Set NAs to median
atm2_data_ts$Cash[is.na(atm2_data_ts$Cash)] <- median

summary(atm2_data_ts)
##       DATE           ATM                 Cash      
##  Min.   :39934   Length:365         Min.   :  0.0  
##  1st Qu.:40025   Class :character   1st Qu.: 26.0  
##  Median :40116   Mode  :character   Median : 67.0  
##  Mean   :40116                      Mean   : 62.6  
##  3rd Qu.:40207                      3rd Qu.: 93.0  
##  Max.   :40298                      Max.   :147.0
atm2_data_ts %>%
  model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic"))
        ) %>% 
  components() %>% autoplot()

atm3_data_ts <- atm_data_ts %>%
  filter(ATM == 'ATM3', Cash > 0)

atm3_data_ts %>% autoplot(Cash)

summary(atm3_data_ts)
##       DATE           ATM                 Cash      
##  Min.   :40296   Length:3           Min.   :82.00  
##  1st Qu.:40296   Class :character   1st Qu.:83.50  
##  Median :40297   Mode  :character   Median :85.00  
##  Mean   :40297                      Mean   :87.67  
##  3rd Qu.:40298                      3rd Qu.:90.50  
##  Max.   :40298                      Max.   :96.00
# Literally just 3 values above 0
atm4_data_ts <- atm_data_ts %>%
  filter(ATM == 'ATM4')

atm4_data_ts %>% autoplot(Cash)

summary(atm4_data_ts)
##       DATE           ATM                 Cash          
##  Min.   :39934   Length:365         Min.   :    1.563  
##  1st Qu.:40025   Class :character   1st Qu.:  124.334  
##  Median :40116   Mode  :character   Median :  403.839  
##  Mean   :40116                      Mean   :  474.043  
##  3rd Qu.:40207                      3rd Qu.:  704.507  
##  Max.   :40298                      Max.   :10919.762
atm4_data_ts %>%
  model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic"))
        ) %>% 
  components() %>% autoplot()

Summary output DATE ATM Cash
Min. :39934 Length:1474 Min. : 0.0
1st Qu.:40026 Class :character 1st Qu.: 0.5
Median :40118 Mode :character Median : 73.0
Mean :40118 Mean : 155.6
3rd Qu.:40210 3rd Qu.: 114.0
Max. :40312 Max. :10919.8
NA’s :19

Date 39934 to 40312

379 dates

ATM1, ATM2, ATM3, ATM4, NA

Cash 19 NA’s

Data observations: ATM1: 3 Cash NA values ATM2: 2 Cash NA values ATM3: Only 3 Cash values with something above zero ATM4: Many Cash values with a decimal, but not all, something weird there, also ATM4 appears to have 1 really crazy outlier Final 14 entries are NA, NA (DATE of 40299 and higher are NA, NA)

Dimensions output [1] 1474 3

Part B – Forecasting Power

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

# Read in data
power_data_raw <- read_excel("data/ResidentialCustomerForecastLoad-624.xlsx")

# Change column name to 'Month'
names(power_data_raw)[names(power_data_raw) == 'YYYY-MMM'] <- 'Month'

head(power_data_raw)
## # A tibble: 6 × 3
##   CaseSequence Month        KWH
##          <dbl> <chr>      <dbl>
## 1          733 1998-Jan 6862583
## 2          734 1998-Feb 5838198
## 3          735 1998-Mar 5420658
## 4          736 1998-Apr 5010364
## 5          737 1998-May 4665377
## 6          738 1998-Jun 6467147
summary(power_data_raw)
##   CaseSequence      Month                KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1
dim(power_data_raw)
## [1] 192   3
# 192   3
# KWH

power_data_ts <- power_data_raw %>%
  mutate(Month = yearmonth(Month)) %>%
  mutate(KWH = KWH/1e3) %>% # In thousands
  as_tsibble(index = Month)

head(power_data_ts)
## # A tsibble: 6 x 3 [1M]
##   CaseSequence    Month   KWH
##          <dbl>    <mth> <dbl>
## 1          733 1998 Jan 6863.
## 2          734 1998 Feb 5838.
## 3          735 1998 Mar 5421.
## 4          736 1998 Apr 5010.
## 5          737 1998 May 4665.
## 6          738 1998 Jun 6467.
power_data_ts %>%
  autoplot(KWH)

power_data_ts %>%
  filter_index("2010" ~ "2011") %>%
  print()
## # A tsibble: 13 x 3 [1M]
##    CaseSequence    Month   KWH
##           <dbl>    <mth> <dbl>
##  1          877 2010 Jan 9397.
##  2          878 2010 Feb 8391.
##  3          879 2010 Mar 7348.
##  4          880 2010 Apr 5776.
##  5          881 2010 May 4919.
##  6          882 2010 Jun 6696.
##  7          883 2010 Jul  771.
##  8          884 2010 Aug 7923.
##  9          885 2010 Sep 7819.
## 10          886 2010 Oct 5876.
## 11          887 2010 Nov 4801.
## 12          888 2010 Dec 6153.
## 13          889 2011 Jan 8395.

First observations, data is Monthly, so I’d expect a seasonal component. 1 month is missing KWH has 1 NA value (2008 Sep) Considering imputing with median Sep Value Outlier (with very small value in July 2010) Considering imputing with median Jul Value

Part C – Waterflow (optional)

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

#30#